arXiv: 1504.0785lv5 [cs.DS] 16 Sep 2016 


Dynamic Relative Compression, Dynamic Partial Sums, and 

Substring Concatenation* 

Philip Bille Patrick Hagge Cording Inge Li Gortz 
Frederik Rye Skjoldjensen Hjalte Wedel Vildhpj Spren Vind 

Technical University of Denmark 


Abstract 

Given a static reference string R and a source string S, a relative compression of S 
with respect to R is an encoding of S' as a sequence of references to substrings of R. 
Relative compression schemes are a classic model of compression and have recently proved 
very successful for compressing highly-repetitive massive data sets such as genomes and 
web-data. We initiate the study of relative compression in a dynamic setting where the 
compressed source string S is subject to edit operations. The goal is to maintain the 
compressed representation compactly, while supporting edits and allowing efficient random 
access to the (uncompressed) source string. We present new data structures that achieve 
optimal time for updates and queries while using space linear in the size of the optimal 
relative compression, for nearly all combinations of parameters. We also present solutions 
for restricted and extended sets of updates. To achieve these results, we revisit the dynamic 
partial sums problem and the substring concatenation problem. We present new optimal or 
near optimal bounds for these problems. Plugging in our new results we also immediately 
obtain new bounds for the string indexing for patterns with wildcards problem and the 
dynamic text and static pattern matching problem. 


1 Introduction 

Given a static reference string R and a source string S, a relative compression of S with 
respect to R is an encoding of S as a sequence of references to substrings of R. Relative 
compression (or external macro compression) is a classic model of compression defined by 
Storer and Szymanski |.38if39l in 1978 and has since been used in a wide range of compression 
scenarios [5ll9l l2Tll26ll2Tll29ll3U] . To compress massive highly-repetitive data sets, such as 
biological sequences and web collections, relative compression has been shown to be very 
practical |2T| 261127], 

Relative compression is often applied to compress multiple similar source strings. In such 
settings relative compression is superior to compressing the source strings individually. For 
instance, human genomes are 99% similar and hence relative compression might be used to 
compress a large collection of sequenced genomes using, e.g., the human reference genome as 
the static reference string. We focus on the case of compressing a single source string, but our 
results trivially generalize to compressing multiple source strings. 

*An extended abstract appeared in the proceedings of the 27th International Symposium on Algorithms 
and Computation (ISAAC). 
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In this paper we initiate the study of relative compression in a dynamic setting , where the 
compressed source string S is subject to edit operations (insertions, deletions, and replace¬ 
ments of single characters). The goal is to maintain the compressed representation compactly, 
while supporting edits and allowing efficient random access to the (uncompressed) source 
string. Efficient data structures supporting these operations allow us to avoid costly recom¬ 
pression of massive data sets after updates. 

We provide the first non-trivial bounds for this problem. We present new data structures 
that achieve optimal time for updates and queries while using space linear in the size of 
the optimal relative compression, for nearly all combinations of parameters. We also present 
solutions for restricted and extended sets of updates. 

To achieve these results, we revisit the dynamic partial sums problem and the substring 
concatenation problem. We present new optimal or near optimal bounds for both of these 
problems (see detailed discussion below). Furthermore, plugging in our new results immedi¬ 
ately leads to new bounds for the string indexing for patterns with wildcards problem [Hf28l 
and the the dynamic text and static pattern matching problem [2]. 

1.1 Dynamic Relative Compression 

Given a reference string R and a source string S, a relative compression of S with respect 
to R is a sequence C = ■■■, {hc\i3\C\) such that S = R[i±, j\] • • • R[i\c\, j\c\]- We call 

C a substring cover for S. The substring cover is optimal if |C| is minimum over all relative 
compressions of S with respect to R. The dynamic relative compression problem is to maintain 
a relative compression of S under the following operations. Let i be a position in S and a be 
a character. 

access(i): return the character 5[i], 
replace^, a): change S'fz] to character a, 
insert(z, a): insert character a before position i in S, 
delete(i): delete the character at position i in S. 

Note that operations insert and delete change the length of S’ by a single character. In all 
bounds below, the access (i) operation extends to decompressing an arbitrary substring of 
length £ using only 0(£) additional time. 

Our Results Throughout the paper, let r be the length of the reference string R, N be the 
length of the (uncompressed) string S, and n be the size of an optimal relative compression of 
S with regards to R. All of the bounds mentioned below and presented in this paper hold for a 
standard unit-cost RAM with tc-bit words with standard arithmetic and logical operations on 
a word. This means that the algorithms can be implemented directly in standard imperative 
programming languages such as C (25| or C+-1- [40j . An index into R or S can be stored in a 
single word and hence w > log(n + r). 

Theorem 1 . Let R and S be a reference and source string of lengths r and N , respectively, 
and let n be the length of the optimal substring cover of S by R. Then, we can solve the 
dynamic relative compression problem supporting access, replace, insert, and delete 
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(i) in 0(n + r) space and O ^ lo °^ n + log log rj time per operation, or 

(ii) in 0(n + r log e r) space and O ( loj^fog n ) ^ me P er operation, for any constant e > 0. 

These are the first non-trivial bounds for the problem. Together, the bounds are optimal for 
most natural parameter combinations. In particular, any data structure for a string of length 
N supporting access, insert, and delete must use fl(log N/ log log N) time in the worst-case 
regardless of the space [HJ (this is called the list representation problem). Since n < N, 
we can view 0(logn/loglogn) as a compressed version of the optimal time bound that is 
always 0(log N/ log log N) and better when S is compressible. Hence, Theorem QJi) provides 
a linear-space solution that achieves the compressed time bound except for an O (log log r) 
additive term. Note that whenever n > (logr) logElogr , for any e > 0, the log n/ log log n term 
dominates the query time and we match the compressed time bound. Hence, Theorem m 
is only suboptimal in the special case when n is almost exponentially smaller than r. In this 
case, we can use Theorem [TJii) which always provides a solution achieving the compressed 
time bound at the cost of increasing the space to 0(n + rlog € r). 

We note that dynamic compression under different models of compression has been studied 
extensively [TIHT3|rT8l[2l.32,37j. However, all of these results require space dependent on the 
size of the original string and hence cannot take full advantage of highly-repetitive data. 

1.2 Dynamic Partial Sums 

The partial sums problem is to maintain an array Z[l..s] under the following operations. 
sum(i): return ^®. =1 Z\j\, 
update(i, A): set Z[i] = Z[i\ + A, 

search(f): return 1 < i < s such that sum(i — 1) < t < sum(i). To ensure well-defined 
answers, we require that Z[i\ > 0 for all i. 

The partial sums problem is a classic and well-studied problem }8i,;10jTn[20, [22ll23l[3il[36] . 
In our context, we consider the problem in the word RAM model, where each array entry 
stores a re-bit integer and the element of the array can be changed by 5-bit integers, i.e., the 
argument A can be stored in 5 bits. In this setting, Patra§cu and Demaine [34] gave a linear- 
space data structure with ©(log s/ log(w/8)) time per operation. They also gave a matching 
lower bound. 

We consider the following generalization supporting dynamic changes to the array. The 
dynamic partial sums problems is to additionally support the following operations. 

insert(r, A): insert a new entry in Z with value A before Z[i\, 

delete(r): delete the entry Z[i\ of value at most A. 

merge(i): replace entry Z[i) and Z[i + 1] with a new entry with value Z[i) + Z[i + 1]. 

divide(i, t): , where 0 < t < Z[i). Replace entry Z[i] by two new consecutive entries with 
value t and Z[i] — t, respectively. 

Hon et al. m and Navarro and Sadakane [33] presented optimal solutions for this problem 
in the case where the entries in Z are at most polylogarithmic in s (they did not explicitly 
consider the merge and divide operation). 
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Our Results We show the following improved result. 

Theorem 2. Given an array of length s storing w-bit integers and parameter 5, such that 
A < 2 s , we can solve the dynamic partial sums problem supporting sum, update, search, insert, 
delete, merge, and divide in linear space and 0(logs/\og(w/5)) time per operation. 

Note that this bound simultaneously matches the optimal time bound for the standard partial 
sums problem and supports storing arbitrary w-bit values in the entries of the array, i.e., the 
values we can handle in optimal time are exponentially larger than in the previous results. 

To achieve our bounds we extend the static solution by Patra§cu and Demaine [34] . Their 
solution is based on storing a sampled subset of representative elements of the array and 
difference encode the remaining elements. They pack multiple difference encoded elements in 
words and then apply word-level parallelism to speedup the operations. To support insert and 
delete the main challenge is to maintain the representative elements that now dynamically 
move within the array. We show how to efficiently do this by combining a new representation 
of representative elements with a recent result by Patra§cu and Thorup [35j. Along the way 
we also slightly simplify the original construction by Patra§cu and Demaine [34] . 

1.3 Substring Concatenation 

Let R be a string of length r. A substring concatenation query on R takes two pairs of indices 
(i,j) and ( i',j ') and returns the start position in R of an occurrence of R[i, j]R[i', j'], or NO 
if the string is not a substring of R. The substring concatenation problem is to preprocess R 
into a data structure that supports substring concatenation queries. 

Amir et al. |2] gave a solution using 0(ry/logr ) space with query time O(loglogr), and 
recently Gawrychowski et al. m showed how to solve the problem in 0(r log r) space and 
0(1) time. 

Our Results We give the following improved bounds. 

Theorem 3. Given a string R of length r, the substring concatenation problem can be solved 
in either 

(i) 0(rlog e r) space and 0(1) time, for any constant e > 0 , or 

(ii) 0(r) space and O(loglogr) time. 

Hence, Theorem (3](i) matches the previous 0(1) time bound while reducing the space from 
0(r logr) to 0(rlog e r) and Theorem[3ln) achieves linear space while using O(loglogr) time. 
Plugging in the two solutions into our solution for dynamic relative compression leads to the 
two branches of Theorem [I] 

To achieve the bound in (i), the main idea is a new construction that efficiently combines 
compact data structure for ID range reporting [3] with the recent constant time weighted 
level ancestor data structure for suffix trees m- The bound in (ii) follows as a simple im¬ 
plication of another recent result for unrooted LCP queries [1] by some of the authors. The 
substring concatenation problem is a key component in several solutions to the string indexing 
for patterns with wildcards problem [HOES], where the goal is to preprocess a string T to 
support pattern matching queries for patterns with wildcards. Plugging in Theorem [3](i) we 
immediately obtain the following new bound for the problem. 
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Corollary 1. Let T be a string of length t. For any pattern string P of length p with k 
wildcards, we can support pattern matching queries on T using 0(tlog e t ) space and 0(p + cr k ) 
time for any constant e > 0. 

This improves the running time of fastest linear space solution by a factor log log t at the cost 
of increasing the space slightly by a factor log e t. See }28| for detailed overview of the known 
results. 

1.4 Extensions 

Finally, we present two extensions of the dynamic relative compression problem. 

1.4.1 Dynamic Relative Compression with Access and Replace 

If we restrict the operations to access and replace we obtain the following improved bound. 

Theorem 4. Let R and S be a reference and source string of lengths r and N, respectively, 
and let n be the length of the optimal substring cover of S by R. Then, we can solve the 
dynamic relative compression problem supporting access and replace in 0(n + r) space and 
O(logloglV) expected time. 

This version of dynamic relative compression is a key component in the dynamic text and 
static pattern matching problem, where the goal is to efficiently maintain a set of occurrences 
of a pattern P in a text T that is dynamically updated by changing individual characters. 
Let p and t denote the lengths of P and T, respectively. Amir et al. [2] gave a data structure 
using Oft + py/l og p) space which supports updates in O(loglogp) time. The computational 
bottleneck in the update operation is to update a substring cover of size 0{p). Plugging in 
the bounds from Theorem [U we immediately obtain the following improved bound. 

Corollary 2. Given a pattern P and text T of lengths p and t, respectively, we can solve the 
dynamic text and static pattern matching problem in 0(t+p ) space and O(loglogp) expected 
time per update. 

Hence, we match the previous time bound while improving the space to linear. 

1.4.2 Dynamic Relative Compression with Split and Concatenate 

We also consider maintaining a set of compressed strings under split and concatenate opera¬ 
tions (as in Alstrup et al. HI)- Let R be a reference string and let S = {Si,..., S^} be a set 
of strings compressed relative to R. In addition to access, replace, insert and delete we also 
define the following operations. 

concat(i,j): Add string Si ■ Sj to S and remove Si and Sj. 

split(i, j): Remove S* from S and add Si[l,j — 1] and Si[j, |Sj|]. 

We obtain the following bounds. 

Theorem 5. Let R be a reference string of length r, let S = {Si, ..., S&} be a set of source 
strings of total length N, and let n be the total length of the optimal substring covers of the 
strings in S. Then, we can solve the dynamic relative compression problem supporting access, 
replace, insert, delete, split, and concat, 
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(i) in space 0[n + r) and time O(logn) for access and time 0(logn + loglogr) for replace, 
insert, delete, split, and concat, or 

(ii) in space 0{n + r\og e r) and time O(logn) for all operations. 

Hence, compared to the bounds in Theorem |T] we only increase the time bounds by an addi¬ 
tional log log n factor. 

2 Dynamic Relative Compression 

In this section we show how Theorems [2] and [3] lead to Theorem [0 The proofs of Theorems [2] 
and [3] appear in Section [3] and Section [U respectively. 

Let C = ((ii, ji ),..., {i\c\ij\c\)) be the compressed representation of S. From now on, we 
refer to C as the cover of S, and call each element (ii,ji) in C a block. Recall that a block 
(■ ii,ji ) refers to a substring R[ii,ji] of R. A cover C is maximal if concatenating any two 
consecutive blocks (ii+i,ji+i) in C yields a string that does not occur in R, i.e., the 

string R[ii, ji\R[ii + i, ji+i\ is not a substring of R. We need the following lemma. 

Lemma 1. If C MAX is a maximal cover and C is an arbitrary cover of S, then ICmaxI < 

2|C|-1. 

Proof. In each block b of C there can start at most two blocks in Cmax, because otherwise 
two adjacent blocks in C M ax would be entirely contained in the block 6, contradicting the 
maximality of Cmax- Since the last block of both C and Umax end at the last position of S, a 
contradiction of the maximality is already obtained when more than one block of C M ax start 
in the last block of C. Hence, |C MAX | < 2|Cj — 1. □ 

Recall that n is the size of an optimal cover of S with regards to R. The lemma implies that 
we can maintain a compression of size at most 2n — 1 by maintaining a maximal cover of S. 
The remainder of this section describes our data structure for maintaining and accessing such 
a cover. 

Initially, we can use the suffix tree of R to construct a maximal cover of S in 0(N + r) 
time by greedily matching the maximal prefix of the remaining part of S with any suffix of 
R. This guarantees that the blocks constitute a maximal cover of S. 

2.1 Data Structure 

The high level idea for supporting the operations on S is to store the sequence of block lengths 
ji ~ H + 1) ■ ■ ■ ,j\c\ ~~ *|Cj + 1 in a dynamic partial sums data structure. This allows us, for 
example, to identify the block that encodes the k th character in S by performing a search (k) 
query. 

Updates to S are implemented by splitting a block in C. This may break the maximality 
property so we use substring concatenation queries on R to detect if blocks can be merged. 
We only need a constant number of substring concatenation queries to restore maximality. 
To maintain the correct sequence of block lengths we use update, divide and merge operations 
on the dynamic partial sums data structure. 

Our data structure consist of the string R , a substring concatenation data structure of 
Theorem [3] for R, a maximal cover C for S stored in a doubly linked list, and the dynamic 
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partial sums data structure of Theorem [2] storing the block lengths of C. We also store 
auxiliary links between a block in the doubly linked list and the corresponding block length 
in the partial sums data structure, and a list of alphabet symbols in R with the location of an 
occurrence for each symbol. By Lemma[T]and since C is maximal we have \C\ < 2n—1 = 0(n). 
Hence, the total space for C and the partial sums data structure is 0(n). The space for R 
is 0{r) and the space for substring concatenation data structure is either 0(r ) or 0{r\og e r) 
depending on the choice in Lemma[3l Hence, in total we use either 0(n + r) or 0(n + r log 6 r) 
space. 

2.2 Answering Queries 

To answer access(r) queries we first compute search (i) in the dynamic partial sums structure 
to identify the block bi = containing position i in S. The local index in R[ii,ji] of the 

i th character in R is £ = i — sum(? — 1), and thus the answer to the query is the character 

R[ii+e-l]. 

We perform replace and delete by first identifying b[ = (ii,ji) and l as above. Then we 
partition bi into three new blocks bf = +1 — 2), bf = (ii + i— 1 ,ii + i — 1), bf = (ii + t,ji) 

where bf is the single character block for index i in S that we must change. In replace we 
change bf to an index of an occurrence in R of the new character (which we can find from the 
list of alphabet symbols), while we remove bf in delete. The new blocks and their neighbors, 
that is, 6/_i, bf, bf, bf, and bi + 1 may now be non-maximal. To restore maximality we perform 
substring concatenation queries on each consecutive pair of these 5 blocks, and replace non- 
maximal blocks with merged maximal blocks. All other blocks are still maximal, since the 
strings obtained by concatenating by with 1 , for all l' <1 — 1 and all V > l, was not present 
in R before the change and is not present afterwards. A similar idea is used by Amir et al. [2]. 
We perform update, divide and merge operations to maintain the corresponding lengths in 
the dynamic partial sums data structure. The insert operation is similar, but inserts a new 
single character block between two parts of bi before restoring maximality. Observe that using 
5 = 0(1) bits in update is sufficient to maintain the correct block lengths. 

In total, each operation requires a constant number of substring concatenation queries 
and dynamic partial sums operations; the latter having time complexity 0(log n/ \og{w/5)) = 
O (log n/ log log n) as w > logn and 5 = 0(1). Hence, the total time for each access, replace, 
insert, and delete operation is either 0(log n/ log log n+log log r) or 0(log n/ log log n) depend¬ 
ing on the substring concatenation data structure used. In summary, this proves Theorem [TJ 

3 Dynamic Partial Sums 

In this section we prove Theorem [2j We support the operations insert(z, A) and delete(z) on 
a sequence of in-bit integer keys by implementing them using update and a divide or merge 
operation, respectively. This means that we support inserting or deleting keys with value at 
most 2 5 . 

We first solve the problem for small sequences. The general solution uses a standard 
reduction, storing Z at the leaves of a B-tree of large outdegree. We use the solution for small 
sequences to navigate in the internal nodes of the B-tree. 

Dynamic Integer Sets We need the following recent result due to Patra§cu and Tho- 
rup [35] on maintaining a set of integer keys X under insertions and deletions. The queries 
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are as follows, where q is an integer. The membership query member(q) returns true if q E X, 
predecessor pred x (q) returns the largest key x € X where x < q, and successor succx(<?) 
returns the smallest key x € X where x > q. The rank rankx((7) returns the number of keys 
in X smaller than q , and select(i) returns the z th smallest key in X. 

Lemma 2 (Patra§cu and Thorup [35]). There is a data structure for maintaining a dynamic 
set of w°^ w-bit integers that supports insert, delete, membership, predecessor, successor, 
rank and select in constant time per operation. 

3.1 Dynamic Partial Sums for Small Sequences 

Let Z be a sequence of at most B < w °W integer keys. We will show how to store Z in linear 
space such that all dynamic partial sums operations can be performed in constant time. We 
let Y be the sequence of prefix sums of Z, defined such that each key Y[i\ is the sum of the 
first i keys in Z, i.e., Y[i\ = ]T* =1 Z\j], Observe that sum(i) = Y[i\ and search(f) is the index 
of the successor of t in Y. Our goal is to store and maintain a representation of Y subject to 
the dynamic operations update, divide and merge in constant time per operation. 

3.1.1 The Scheme by Patra§cu and Demaine 

We first review the solution to the static partial sums problem by Patra§cu and Demaine ;.Til . 
slightly simplified due to Lemma El Our dynamic solution builds on this. 

The entire data structure is rebuilt every B operations as follows. We first partition Y 
greedily into runs. Two adjacent elements in Y are in the same run if their difference is at 
most B2 S , and we call the first element of each run a representative for all elements in the 
run. We use 7Z to denote the sequence of representative values in Y and rep(z) to be the index 
of the representative for element Y[i\ among the elements in 1Z. 

We store Y by splitting representatives and other elements into separate data structures: 
X and 1Z store the representatives at the time of the last rebuild, while U stores each element 
in Y as an offset to its representative value as well as updates since the last rebuild. We ensure 
Y[i\ = 72.[rep(*)] +U[i\ for any i and can thus reconstruct the values of Y. 

The representatives are stored as follows. X is the sequence of indices in Y of the repre¬ 
sentatives and 1Z is the sequence of representative values in Y. Both X and 1Z are stored using 
the data structure of Lemma [2] We can then define rep(z) = rankx(predj(z)) as the index of 
the representative for i among all representatives, and use 7£[rep(z)] = select 7 ^.(rep(*)) to get 
the value of the representative for i. 

We store in U the current difference from each element to its representative, U[i] =Y[i\ — 
7£[rep(*)] (i.e. updates between rebuilds are applied to U). The idea is to pack U into a single 
word of B elements. Observe that update(i, A) adds value A to all elements in Y with index at 
least i. We can support this operation in constant time by adding to U a word that encodes 
A for those elements. Since each difference between adjacent elements in a run is at most 
B2 s and |T| = O(B), the maximum value in U after a rebuild is 0(B 2 2 s ). As B updates of 
size 2 & may be applied before a rebuild, the changed value at each element due to updates 
is 0(B2 s ). So each element in U requires OflogB + <5) bits (including an overflow bit per 
element). Thus, U requires 0(B(\ogB + 5)) bits in total and can be packed in a single word 
for B = 0(min{u;/ log w, w/6}). 

Between rebuilds the stored representatives are potentially outdated because updates may 
have changed their values. However, observe that the values of two consecutive representatives 


differ by more than B2 s at the time of a rebuild, so the gap between two representatives cannot 
be closed by B updates of 5 bits each (before the structure is rebuilt again). Hence, an answer 
to search (t) cannot drift much from the values stored by the representatives; it can only be 
in a constant number of runs, namely those with a representative value succ^(t) and its two 
neighboring runs. In a run with representative value v, we find the smallest j (inside the run) 
such that U\j\ + v — t > 0. The smallest j found in all three runs is the answer to the search (t) 
query. Thus, by rebuilding periodically, we only need to check a constant number of runs 
when answering a search (t) query. 

On this structure, Patra§cu and Demaine [3j] show that the operations sum, search and 
update can be supported in constant time each as follows: 

sum(i): return the sum of 7£[rep(*)] and U[i\. This takes constant time as U[i\ is a field in a 
word and representatives are stored using Lemma [2J 

search(t): let ro = rank 7 j(succ 7 j(t)). We must find the smallest j such that U\j] + R[r] — t> 0 
for r € {ro — 1, ro, ro +1}, where j is in run r. We do this for each r using standard word 
operations in constant time by adding R[r] — t to all elements in U, masking elements 
not in the run (outside indices selectx(r) to selectx(r +1) — 1, and counting the number 
of negative elements. 

update(z, A): we do this in constant time by copying A to all fields j > i by a multiplication 
and adding the result to U. 

To count the number of negative elements or find the least significant bit in a word in constant 
time, we use the technique by Fredman and Willard [El- 

Notice that rebuilding the data structure every B operations takes O(B) time, resulting 
in amortized constant time per operation. We can instead do this incrementally by a standard 
approach by Dietz [8], reducing the time per operation to worst case constant. The idea is to 
construct the new replacement data structure incrementally while using the old and complete 
data structure. 

3.1.2 Efficient Support for divide and merge 

We now show how to maintain the structure described above while supporting operations 
divide(i,f) and merge(i). An example supporting the following explanation is provided in 
Figure [1] 

Observe that the operations are only local: Splitting Z[i\ into two parts or merging Z[i} 
and Z[i + 1] does not influence the precomputed values in Y (besides adding/removing val¬ 
ues for the divided/merged elements). We must update X, 7Z and U to reflect these local 
changes accordingly. Because a divide or merge operation may create new representatives be¬ 
tween rebuilds with values that do not fit in U, we change X, 1Z and U to reflect these new 
representatives by rebuilding the data structure locally. This is done as follows. 

Consider the run representatives. Both divide(i,i) and merge(i) may require us to create a 
new run, combine two existing runs or remove a run. In any case, we can find a replacement 
representative for each run affected. As the operations are only local, the replacement is either 
a divided or merged element, or one of the neighbors of the replaced representative. Replacing 
representatives may cause both indices and values for the stored representatives to change. 
We use insertions and deletions on 1Z to update representative values. 
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a) The initial data structure constructed from Z. 
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b) The result of divide(8,3) on the structure of a). Represen¬ 
tative value 30 was removed from 1Z. We shifted and updated 
U, B and C to remove the old representative and accommo¬ 
date for a new element with value 2. 
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c) The result of merge(12) on the structure of c). 


Figure 1: Illustrating operations on the data structure with B2 S = 4. a) shows the data 
structure immediately after a rebuild, b) shows the result of performing divide(8,3) on the 
structure of a), and c) shows the result of performing merge(12) on the structure of b). 
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Since the new operations change the indices of the elements, these changes must also be 
reflected in X. For example, a merge(i) operation decrements the indices of all elements with 
index larger than i compared to the indices stored at the time of the last rebuild We should 
in principle adjust the 0(B ) changed indices stored in X. The cost of adjusting the indices 
accordingly when using Lemma[2]to store X is O(B). Instead, to get our desired constant time 
bounds, we represent X using a resizable data structure with the same number of elements 
as Y that supports this kind of update. We must support selectj(i), rank i(q), and pred x (q) 
as well as inserting and deleting elements in constant time. Because X has few and small 
elements, we can support the operations in constant time by representing it using a bitstring 
B and a structure C which is the prefix sum over B as follows. 

Let B be a bitstring of length |T| < B, where B[i] = 1 iff there is a representative at index 
i. C has \Y\ elements, where C[i] is the prefix sum of B including element i. Since C requires 
0(B log B ) bits in total we can pack it in a single word. We answer queries as follows: ranker/) 
equals C[q — 1], we answer selectj(*) by subtracting i from all elements in C and return one 
plus the number of elements smaller than 0 (as done in U when answering search), and we find 
predj(g) as the index of the least significant bit in B after having masked all indices larger 
than q. Updates are performed as follows. Using mask, shift and concatenate operations, we 
can ensure that B and C have the same size as Y at all times (we extend and shrink them 
when performing divide and merge operations). Inserting or deleting a representative is to set 
a bit in B, and to keep C up to date, we employ the same ±1 update operation as used in U. 

We finally need to adjust the relative offsets of all elements with a changed representative 
in U (since they now belong to a representative with a different value). In particular, if the 
representative for U.\j\ changed value from v to v', we must subtract v' — v from U\j\. This can 
be done for all affected elements belonging to a single representative simultaneously in U by 
a single addition with an appropriate bitmask (update a range of U.). Note that we know the 
range of elements to update from the representative indices. Finally, we may need to insert or 
delete an element in U , which can be done easily by mask, shift and concatenate operations 
on the word U. This leads to Theorem [6l 

Theorem 6. There is a linear space data structure for dynamic partial sums supporting 
each operation search, sum, update, insert, delete, divide, and merge on a sequence of length 
0(nxm{w/ log w, w/S}) in worst-case constant time. 

3.2 Dynamic Partial Sums for Large Sequences 

Willard [43j (and implicitly Dietz [8]) showed that a leaf-oriented B-tree with out-degree B 
of height h can be maintained in 0(h ) worst-case time if: 1) searches, insertions and deletions 
take 0(1) time per node when no splits or merges occur, and 2) merging or splitting a node 
of size B requires 0(B) time. 

We use this as follows, where Z is our integer sequence of length s. Create a leaf- 
oriented B-tree of degree B = ©(minjru/ log w, w/6}) storing Z in the leaves, with height 
h = 0(log B n ) = 0(logn/log(w/6)). Each node v uses Theorem [6] to store the 0(B) sums of 
leaves in each of the subtrees of its children. Searching for t in a node corresponds to finding 
the successor Y[i] of t among these sums. Dividing or merging elements in Z corresponds to 
inserting or deleting a leaf. This concludes the proof of Theorem [2j 
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4 Substring Concatenation 


In this section we prove Theorem [31 Recall that we must store a string R subject to substring 
concatenation queries: given two strings x and y return the location of an occurrence of xy 
in R or NO if no such occurrence exist. 

To prove (i) we need the following definitions. For a substring x of R, let S(x) denote the 
suffixes of R that have x as a prefix, and let S'(x) = {i + |x| | i € S(x) Ai + |x| < n}, i.e., S'(x) 
are the suffixes of R that are immediately preceded by x. Hence for two substrings x and y, 
the suffixes that have xy as a prefix are exactly S'(x) C\S(y). We can reduce this intersection 
problem to a ID range emptiness problem as follows. 

Let rank('i) be the position of suffix R[i..r ] in the lexicographic ordering of all suffixes of 
R, and let rank(A) = {rank(i) | i € A} for A C {l..n}. Then xy is a substring of R if and 
only if rank(5"(x)) n rank(5(y)) / 0. Note that rank(£(y)) is a range [a, b] C [l,n], and we 
can determine this range in constant time for any substring y using a constant-time weighted 
ancestor query on the suffix tree of R US], Consequently, we can decide if xy is a substring 
of R by a ID range emptiness query on the set rank(5 , (x)). 

Belazzougui et al. [3j (see also m) recently gave a ID range emptiness data structure 
for a set A C [l,r] using 0(|A|log e r) bits of space, for any constant e > 0, and answering 
queries in constant time. We will build this data structure for rank(5 / (x)), but doing so for 
all substrings would require space D(r 2 ). 

To arrive at the space bound of 0(rlog e r) (words), we employ a heavy path decomposi¬ 
tion m on the suffix tree of R, and only build the data structure for substrings of R that 
correspond to the top of a heavy path. In this way, each suffix will appear in at most logr 
such data structures, leading to the claimed 0(r log 6 r) space bound (in words). In addition, 
we build a 0(r)-space nearest common ancestor data structure fT§ l for the suffix tree of R. 
Constant-time nearest common ancestor queries will allow us to also answer longest common 
prefix queries on R in constant time. 

To answer a substring concatenation query with substrings x and y, we first determine 
how far y follows the heavy path in the suffix tree from the location where x stops. This can 
be done in 0(1) time by a constant-time longest common prefix query between two suffixes 
of R. We then proceeed to the top of the next heavy path, where we query the ID range 
reporting data structure with the range rank(S'(y / )) where y' is the remaining unmatched 
suffix of y. This completes the query, and the proof of (i). 

The second solution (ii) is an implication of a result by Bille et al. [3]. Given the suffix 
tree STr of R, an unrooted longest common prefix query [6] takes a suffix y and a location l in 
ST a (either a node or a position on an edge) and returns the location in ST$ that is reached 
after matching y starting from location t. A substring concatenation query is straightforward 
to implement using two unrooted longest common prefix queries, the first one starting at the 
root, and the second starting from the location returned by the first query. It follows from 
Bille et al. [3] that we can build a linear space data structure that supports unrooted longest 
common prefix queries in time O(loglogr) thus completing the proof of (ii). 

5 Extensions 

In this section we show how to solve two other variants of the dynamic relative compression 
problem. We first prove Theorem 01 showing how to improve the query time if only supporting 
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operations access and replace. We then show Theorem [5] generalising the problem to support 
multiple strings. These data structures use the same substring concatenation data structure 
of Theorem [3] as before but replaces the dynamic partial sums data structure. 

5.1 Dynamic Relative Compression with Access and Replace 

In this setting we constrain the operations on S to access(z) and replace(z, a). Then, instead 
of maintaining a dynamic partial sums data structure over the lengths of the substrings in 
C, we only need a dynamic predecessor data structure over the prefix sums. The operations 
are implemented as before, except that for access(z) we obtain block bj by computing the 
predecessor of z in the predecessor data structure, which also immediately gives us access to 
the local index in bj. For replace(z, a), a constant number of updates to the predecessor data 
structure is needed to reflect the changes. We use substring concatenation queries to restore 
maximality as described in Section [2j The prefix sums of the subsequent blocks in C are 
preserved since \bj\ = |6j| + |6|| + |6||. 

With a linear space implementation of the van Emde Boas data structure [31, CL 42] we 
can support the predecessor queries and updates in 0(loglog N) expected time. For substring 
concatenation we apply Theorem 0]ii) using 0(r ) space and O(loglogr). Since the length of 
source string does not change, we can always assume that r > N, and the total time becomes 
0(loglogiV + log log r) = O(loglogA). In summary, this proves Theorem [U 

5.2 Dynamic Relative Compression with Split and Concatenate 

Consider the variant of the dynamic relative compression problem where we want to maintain 
a relative compression of a set of strings S ±,..., Sk ■ Each string Si has a cover C,; and all strings 
are compressed relative to the same string R. In this setting n = Yli=i IC | ■ In addition to the 
operations access, replace, insert, and delete, we also want to support split and concatenation of 
strings. Note that the semantics of the operations change to indicate the string(s) to perform 
a given operation on. 

We build a leaf-oriented height-balanced binary tree T t (e.g. an AVL tree or red-black 
tree) over the blocks Q[ 1],..., Cj[|Q|] for each string S). In each internal node v, we store 
the sum of the block sizes represented by its leaves. Since the total number of blocks is n, the 
trees use 0{n ) space. All operations rely on the standard procedures for searching, inserting, 
deleting, splitting and joining height-balanced binary trees. All of these run in O(logn) time 
for a tree of size n. See for example [7] for details on how red-black trees achieve this. 

The answer to an access(z, j) query is found by doing a top-down search in T* using the 
sums of block sizes to navigate. Since the tree is balanced and the size of the cover is at most 
n, this takes O(logn) time. The operations replace(z, j, a), insert(z, j, a), and delete(z,j) all 
initially require that we use access(z, j) to locate the block containing the j-th character of S). 
To reflect possible changes to the blocks of the cover, we need to modify the corresponding tree 
to contain more leaves and restore the balancing property. Since the number of nodes added 
to the tree is constant these operations each take O(logn) time. The concat(z,j) operation 
requires that we join two trees in the standard way and restore the balancing property of the 
resulting tree. For the split(z, j) operation we first split the block that contains position j such 
that the )-th character is the trailing character of a block. We then split the tree into two 
trees separated by the new block. This takes O(logn) time for a height-balanced tree. 

To finalize the implementation of the operations, we must restore the maximality property 
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of the affected covers as described in Section [2j At most a constant number of blocks are non- 
maximal as a result of any of the operations. If two blocks can be combined to one, we delete 
the leaf that represents the rightmost block, update the leftmost block to reflect the change, 
and restore the property that the tree is balanced. If the tree subsequently contains an internal 
node with only one child, we delete it and restore the balancing. Again, this takes O(logn) 
time for balanced trees, which concludes the proof of Theorem [5j 

6 Conclusion 

We have shown how to compress a text relatively to a reference string while supporting access 
to the text and a range of dynamic operations under some strong guarantees for the space 
usage and the query times. There are, however, room for improvement. 

Our solution to DRC is built on data structures for the partial sums problem and the 
substring concatenation problem. Our partial sums-solution is optimal, but in order to get 
the desired constant query time for substring concatenation, our data structure uses 0(r log e r) 
space. As opposed to this, our linear space solution leads to O(loglogr) query time. We leave 
as an open problem if it is possible to get 0(1) time substring concatenation queries using 
0(r) space, which will also carry over to a stronger result for the DRC problem. 

Moreover, the size of the cover that is maintained by our DRC data structure is also 
an interesting parameter. Currently we maintain a 2-approximation of the optimal cover. It 
would be useful to know if a better approximation ratio can be maintained under the same 
(or better) time and space bounds that we give. 
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